perm filename REPORT[4,ALS] blob sn#054414 filedate 1973-07-24 generic text, type T, neo UTF8
00010		Some Experiments on Speech Recognition
00100	
00200	A serious attempt has been made to apply some machine learning 
00300	techniques to the problem of speech recognition by machine.
00400	The general approach has been to develop techniques which will
00500	permit the computer system to adapt itself to the charecteristics
00600	of the speaker by means of a training or learning procedure.
00700	
00800	This training procedure might be used to adapt the system to the average
00900	characteristics of a large number of speakers or it could be used to
01000	adapt the system separately to each individual speaker, if the
01100	number of different speakers was not too large. An alternate 
01200	arrangement might be to have a system that was alraedy adapted to
01300	the expected characteristics of the speaker, sufficiently
01400	so, at least, so that it could do a partial job of understanding
01500	and so that it could adapt itself to the speaker during the actual
01600	conversation, or failing in this, it could request the speaker to
01700	repeat some simple text that would enable the system to identify
01800	the particular differences in speaker characteristics that were the
01900	cause of misunderstandings.
02000	
02100	While the early work here reported has been concentrated on the
02200	extraction and use of acoustic information from the speech,
02300	it should not be infered that the methods are necessarily limited
02400	to this usage. Quite the contrary, the techniques can be equally well used
02500	to combine syntactic, semantic, linguistic, and cultural clues as to
02600	what is being said with the acoustic clues. The need for such
02700	additional information is by now well understood and early attempts
02800	at speech recognition were only marginally successful because of a
02900	failure to recognize this need.
03000	
03100	Four quite differend systems have been investigated in some detail.
03200	They all have one characteristic in common in that information
03300	is contained in tables, the so called Signature Tables, as to
03400	the relationships between the acoustic clues contained in the
03500	speech and the desired phonetic (and ultimultly linguistic) output.
03600	They differ in number and size of the tables that are used and in
03700	the way the tables are interconnected. They also differ in the rigour
03800	with which the Baysian probabilities are computed. The earliest
03900	scheme made no attempt at rigour at all but did everything in the
04000	simplest possible way. Subsequent schemes made fewer approximations
04100	and employed somewhat greater table sizes. The most recent scheme
04200	seems to be nearly optimum in terms of the degree of rigour employed
04300	and the sizes of tables envolved.
04400	
04500	The program itself consists of a simple procedure for accumulating
04600	information as to the indicated relationships during training
04700	sequences in which known utterances with accompanying phonetic or
04800	linguistic translations are reviewed, and an even simplier procedure
04900	for performing the indicated translations on future unknown utterances.
05000	Only very minor changes have had to be made to the program itself to
05100	adapt it to quite different table schemes and once a fixed scheme is
05200	chosen no changes need be made to accomodate any desired arrangement
05300	of table interconnections, and of course no changes are required to
05400	adapt the tables to different speakers.
05500	
05600	Ideally one would like to combine all of the clues that are available
05700	for any particular segment of the speech in attempting to identify its
05800	meaning. This is quite impractical, however, both because of the large
05900	number of clues that are available, (the number of dimnsions of the
06000	clue space) and the range of values that are required to represent each
06100	clue. If the functional relational between these variables were known
06200	for the particular speaker on could make the necessary calculation.
06300	Unfortunately the relationships are not known analytically and since
06400	they vary from speaker to speaker it is impractical to attempt to
06500	learn them. Instead, a subset of the available clues is used to define
06600	an entry in a table and counts are accumulated in this table of the
06700	number of times utterances of particular types are accompanied by
06800	specific combinations of the chosen subset of clues. If the subset is
06900	well chosen and if enough samples have been examined then numbers 
07000	can be computed which are in effect the Baysian probabilities that
07100	future instances of these particular combinations of clues will
07200	predict intended meanings of the segments of the utterance.